README for Human-Annotated Comments Dataset
Submitted to ACL 2026 Findings | Harvard Dataverse

Overview
--------
This dataset consists of 3,289 rows and 46 columns of human-annotated comments sourced from Philippine Facebook pages. Each comment is labeled based on relevance to politics, civility, sentiment, and specific characteristics such as the presence of uncivil language, stereotyping, or ideological extremism, providing a resource for studying political discourse in multilingual Philippine social media.

This resource is made available in conjunction with a paper submitted to ACL 2026 Findings:

    Herrera, M. J., Jaidka, K., and Luyt, B. Misinformation Commands Attention:
    An English-Tagalog Dataset of Political Discussions on Philippine Facebook Pages.
    Submitted to Findings of the Association for Computational Linguistics: ACL 2026.

Dataset Structure
-----------------
Number of Rows: 3,289
Number of Columns: 46
Language(s): English and Filipino (Tagalog)
Source: Facebook political pages (Philippines)

Key Columns:

1. General Metadata:
   - Ticket ID: Unique identifier for each comment.
   - Post Date: Date and time when the comment was posted.
   - Site Name, Site URL: Information about the platform (e.g., Facebook).
   - Channel Type, Channel Country, Channel Name: Details about the channel from which the comment originated.
   - Category, Subject Name, Title: Contextual data about the post or discussion topic.

2. Engagement Details:
   - Influence Score: Score indicating the influence or visibility of the comment.
   - Sentiment Score: Sentiment analysis label (e.g., Neutral, Very Negative).

3. Comment Analysis (Human-Annotated):
   - I believe this comment is relevant to politics: Binary label indicating political relevance.
   - I believe this comment is uncivil: Binary label indicating uncivility.
   - Specific content characteristics:
     - Contains obscene language/vulgarity
     - Contains insulting language/name calling
     - Contains ideologically extreme language
     - Contains stereotyping
     - Exaggerated argument
     - Do you think this comment is misleading?

4. Textual Content:
   - Content: Full text of the comment (English and/or Filipino).
   - Additional comments: Annotations or notes by human annotators.

5. Temporal Data:
   - Fields such as YearKey, MonthName, WeekOfYear, and HourKey for timestamp analysis.

Usage and Applications
----------------------
This dataset is particularly suited to NLP and computational social science research. Intended applications include:

1. Toxicity and Civility Detection: Developing models to detect uncivil or harmful language in code-switched Philippine social media text.
2. Sentiment Analysis: Training models to classify sentiment in English-Tagalog comments.
3. Political Discourse Analysis: Studying political relevance and trends in online discussions.
4. Annotation Quality Research: Benchmarking human annotation reliability for multilingual political content.

Sample Data
-----------
| Ticket ID  | Post Date                    | Content                        | Relevant to Politics | Uncivil | Contains Insults | Contains Ideological Extremes |
|------------|------------------------------|--------------------------------|----------------------|---------|------------------|-------------------------------|
| 9564933693 | Mon 02-Jan-2023 13:27:50     | salamat tatay digong           | Yes, I am confident  | No      | No               | No                            |
| 9564932912 | Mon 02-Jan-2023 14:04:04     | Grace poe -- dilangaw yarn     | Yes, I am confident  | Yes     | Yes              | Yes                           |
| 9564934232 | Mon 02-Jan-2023 16:04:14     | Sawsaw n nmn si grace poe..    | Yes, I am confident  | Yes     | Yes              | Yes                           |

Important Notes
---------------
- Data Privacy: Comments have been anonymized. Ensure compliance with applicable privacy laws (e.g., GDPR, Philippines Data Privacy Act) when using this dataset.

- Annotation Confidence: Annotations reflect human judgment and may include subjective interpretations. Inter-annotator agreement metrics are reported in the associated paper.

- Data Cleaning: Some fields contain missing values (e.g., Do you think this comment is misleading?) and may require preprocessing.

Citation
--------
If you use this dataset, please cite the associated ACL 2026 Findings paper:

    Herrera, M. J., Jaidka, K., and Luyt, B. Misinformation Commands Attention:
    An English-Tagalog Dataset of Political Discussions on Philippine Facebook Pages.
    Submitted to Findings of the Association for Computational Linguistics: ACL 2026.

The dataset is hosted on Harvard Dataverse and should also be cited independently
per the repository's citation guidelines.